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Determining what constitutes unusually frequent and rare in a genome is a fun- 
damental and ongoing issue in genomics [6|. Sequence motifs may be frequent 
because they appear in mobile, structural or regulatory elements. It has been sug- 
gested that some recurrent sequence motifs indicate hitherto unknown or poorly 
understood biological phenomena ifTTI . We propose that the distribution of DNA 
words in genomic sequences can be primarily characterized by a double Pareto- 
lognormal distribution, which explains lognormal and power-law features found 
across all known genomes. Such a distribution may be the result of completely 
random sequence evolution by duplication processes. The parametrization of ge- 
nomic word frequencies allows for an assessment of significance for frequent or 
rare sequence motifs. 

The simplest type of sequence motif is a DNA word of a fixed length I, called 
an t-mer. The number of occurrences of a word w is denoted by N(w), and the 
distribution of N(w) across all £-mers is called the word frequency distribution or 
spectrum. The statistical significance of a word's unusual abundance (or rarity) is 
assessed by referring to a null model of random sequences. Standard null mod- 
els include sequences of independent and identically distributed letters (Bernoulli 
model) and low-order Markov models ifToll. In such random text models, over- 
and underrepresentation of short words (typically, 8 < £ < 16) are evaluated by 
using Poisson or Gaussian approximations, implying a rapidly decreasing tail in 
the spectrum. It is also customary to use localized random shuffling of the stud- 

*Department of Computer Science and Operations Research, Universite de Montreal, CP 6128, 
succ. Centre- Ville, Montreal, Quebec H3C 3J7, Canada. 

tLaboratoire d'Informatique Fondamentale de Lille, Bat. M3, 59655 Villeneuve d' Ascq Cedex, 
France. 



1 



words 




1 10 100 1000 



occurrence 

Figure 1: 13-mer frequencies in repeat-masked human chromosome 5, and in 
random sequences of same length. 

ied genome sequence in order to preserve large-scale compositional heterogeneity. 
Empirical word frequency distributions in shuffled sequences also have a light tail. 

In reality, genomic word frequency distributions have a prominent heavy tail, 
which is not captured by random text models (Fig.[T]). Depending on the genome 
length L and its relative size with respect to the number of £-mers A 1 , the spectrum 
may show a power-law decrease on the left or right, or have a lognormal shape. 
The heavy tail on the right cannot be entirely attributed to mobile elements, as it 
is present even in repeat-masked vertebrate and many smaller genomes (Figures El 
and Consequently, random text models and shuffling tend to underestimate the 
probability of frequent words in long sequences. 

To this day, there has been no exact characterization of genomic spectra, aside 
from the observation of power-law behavior for certain word sizes |0 |9j [10) in 
the right-hand tail. This note aims to point out that a parametric distribution de- 
scribes word frequencies extremely well in prokaryotic and eukaryotic genome 
sequences. The distribution in question is the so-called double Pareto-lognormal 
(DPL) distribution [ 15 1. It fits many real-life size distributions, including that of 
wealth in society, human settlement sizes, and file sizes on the Internet. The dis- 
tribution has four parameters: a > 0, (3 > 0, v and r > 0; it has a power-law 
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Figure 2: Spectra in smaller genomes. For each full genome (concatenated chro- 
mosome sequences if necessary), the £-mer frequency distribution is shown by 
dots, and the fitted DPL distribution by a solid line. Notice the power law in the 
lower tail of the B. subtilis genome spectrum. 



(Pareto) tail to the left and to the right, with slopes characterized by the parame- 
ters (3 and a, respectively; in the middle, its shape is dominated by a lognormal 
distribution with parameters v and r. 

Figure |2] illustrates that a single DPL distribution's four parameters can be 
adjusted to describe spectra across hundreds or thousands of word frequencies 
in non-vertebrate genomes. (More examples for different spectra are shown on 
the web site pittp : / /www . iro . umontreal . ca/"csuros/spectrum/| ) 
For very short words with respect to the genome length L (about 8 < t < 
(log 4 L) — 4), the spectrum has mainly a lognormal shape. As £ increases, the 
mode of the lognormal component shifts downwards and the lower power-law 
tail becomes more and more discernible, followed by the appearance of the up- 
per power-law tail. For long words (from around £ > (log 4 L) + 2), the upper 
power-law tail dominates the spectrum. Given that compact genomes are mostly 
composed of coding sequences, the upper power-law tail is another manifestation 
of the power law for protein domain occurrences JSJ. 

In organisms with strong dinucleotide bias, such as for CpG in vertebrates or 
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Figure 3: Spectra of CpG-free £-mers on repeat-masked human chromosome 12. 

for ApT in honey bee 0, the spectrum can be decomposed into multiple DPL 
distributions by dinucleotide content (Figure |4|). Spectra for words without CpG 
dinucleotides in repeat-masked vertebrate sequences have a marked DPL shape 
(Fig-EJ- We illustrate the analysis of a large genomic spectrum with the example 
of human chromosome 12, which is typical of the human genome with respect to 
repeat element distribution and cytosine-guanine content ED. Abundant repeat 
elements may cause deviations from the DPL distribution, which may be the basis 
of their identification using a DPL null model, but often they are absorbed in the 
fundamental DPL curve (FigureEJ). Table [l] analyzes the composition of the spec- 
trum's tail in human chromosome 12. The contribution of non-repeat sequences 
in the tail decreases when moving toward higher word frequencies, but it levels 
off at about 25%. 

The power-law tail of gene family size distributions [8| can be explained by 
birth and death processes j5j[T4l. Similar arguments apply to genomic spectra. 
Consider the occurrences of a particular word along the genome as a "population." 
The population size is affected by mutational events, including duplications, inser- 
tions, deletions and point mutations. The population can increase by any copying 
mechanism, including segmental duplication and retrotranscription. The popula- 
tion decreases if an occurrence is destroyed by some mutation. Point mutations 
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Figure 4: 9-mer frequencies in non-repeat annotated regions of human chromo- 
some 12. The spectrum can be decomposed by CpG content into a few DPL 
distributions, shown by solid lines. 
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sequence (b) 


non-repeat (c) 


SINE (d) 


LINE (e) 


LTR (f) 


other (g) 


12 


19.61 


69.6 


47.2 


16.7 


21.4 


8.4 


6.3 


25 


5.34 


39.7 


41.5 


21.3 


22.6 


7.6 


7.0 


30 


3.67 


33.9 


39.6 


23.1 


22.9 


7.3 


7.2 


50 


1.30 


22.6 


34.0 


28.8 


23.2 


6.5 


7.5 


60 


0.91 


19.9 


32.1 


31.0 


23.1 


6.2 


7.5 


100 


0.37 


14.7 


27.9 


36.7 


22.6 


5.4 


7.4 


120 


0.28 


13.4 


26.8 


38.5 


22.3 


5.1 


7.3 


200 


0.14 


10.6 


24.6 


43.0 


20.7 


4.7 


7.0 


240 


0.10 


9.7 


24.2 


44.9 


19.4 


4.6 


6.9 



Table 1: Composition of the 12-mer spectrum's tail in human chromosome 12. 
Column (a) gives the fraction of words that occur at least n times; column (b) 
lists the fraction of the genome sequence covered by such words. Columns (c- 
g) list the fraction of those word occurrences within non-repeat regions, short 
interspersed elements, long interspersed elements, long terminal repeats, and other 
repeat elements (including DNA transposons, simple repeats, low-complexity and 
tandem repeats), respectively. Fractions are expressed as percentages. 



5 



words 




1 10 100 1000 10000 



Figure 5: Decomposing the contribution of different repeat families in the spec- 
trum of CpG-free 12-mers along human chromosome 12. Most of the deviation 
from DPL in the right-hand tail is caused by Alu and LI families and simple re- 
peats. 
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can create new populations, but so can insertion (at the insertion boundaries) and 
deletion (by bringing two halves of a word together). It is thus conceivable that 
a birth and death model appropriately models the spectrum's evolution. In order 
to illustrate this point, we carried out a simulation experiment in which a DNA 
sequence evolved solely by copying. We iteratively expanded an initial random 
DNA sequence by selecting a contiguous piece of a fixed length m in each iter- 
ation, and copying it back into the sequence at a random position. The resulting 
sequence exhibits the same heavy tail as the real-life sequences do (Figure[T]). The 
parameters of the fitted DPL distribution are affected by the copy size m, and fur- 
ther vary with the introduction of a point mutation process. It is thus likely that 
the DPL distribution of genomic spectra is yet another sign of ceaseless "tinker- 
ing" within the genome. 

Heavy-tail distributions are signs of self- similarity, or long-range autocorre- 
lation, which has been observed before in DNA sequences 1 12 1, and studied 
thoroughly in telecommunications engineering ilTHIl . Long-range autocorrelation 
at the single nucleotide level can result from so-called expansion-randomization 
processes ifTTIl. which model sequence evolution by deletions, mutations and du- 
plications. 

The birth and death model implies that some words occur often simply by 
chance, and not because of their functionality. Words that are abundant at an early 
point of evolution tend to preserve their relative abundance in the course of ran- 
dom copy events, in accordance with the principle of "rich get richer" underlying 
power-law distributions. Therefore, even the high frequency of a particular word 
across many related species does not imply functionality on its own, as the word 
may have been frequent by chance in a common ancestor already. Association 
with genes is not necessarily a sign of functionality either, since transcribed re- 
gions that harbor a frequent word can help its propagation throughout the genome 
by retrotranscription. 

Our investigations show that the distribution of word frequencies can be well 
approximated by a parametric model, the double Pareto-lognormal distribution. 
Such a distribution may result from a long history of evolutionary tinkering: copy- 
ing, rearranging, deleting, and changing different parts of the genome. The heavy 
tail of word frequency distributions means that findings of frequent motifs need to 
be assessed with extreme care, especially if their overrepresentation is related to 
word occurrences in random texts. 
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Methods 



Words were counted only on one strand of the DNA sequences (the 'plus' strand of the 
sequence file), with the exception of the 16-mers in the human genome, where both strands 
were scanned. We counted the occurrence of a word w if it appeared in a given sequence 
at some position + 1 — 1, without ambiguous nucleotides. The DPL distribution was 
fitted using its cumulative distribution function (cdf), which is 

\n.x — v\ a B _ Bv+ a2 r 2/ 2 ,/ \n x — v -\- (3t 2 
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for x > and F(x) = for x < 0, where $(•) denotes the cdf of the standard normal 
distribution. The spectrum consists of the numbers W(n) of £-mers occurring exactly n 
times for all n = 0, 1, 2, . . . In order to fit the distribution's parameters, the spectrum 
(w(n) : n = 0, 1 . . . J was considered as a set of binned values for independently drawn 
samples from a continuous DPL distribution: W[n) was compared to the predicted value 
4?(^F(n + ^) — F(n — |)J. We used custom-made programs to carry out the parameter 
fitting, using the Levenberg-Marquardt algorithm [ 13], a nonlinear least-squares method, 
for which the starting parameter values were set by likelihood maximization (T5\ . 

We defined CpG content of a word w (Fig. as the number of non-overlapping 
CG and GC dinucleotides in w. The contribution of different annotations (Fig. |5jl were 
computed by multiplying each W{n) value in the spectrum by the fraction of occurrences 
within the annotated regions for words appearing n times in the entire sequence. For 
the random shuffling of Fig. ^ we partitioned the sequence into contiguous segments 
containing exactly 1000 non-ambiguous nucleotides. Non-ambiguous nucleotides were 
garbled in each segment by generating a uniform random permutation. 

Human sequences (original and repeat-masked) and repeat annotations (Figures ^ 
13141 and l5b were obtained from the UCSC genome browser [3| gateway's FTP server 
( |ftp : //hqdownload. cse . ucsc . edu/D , for version hgl8 (NCBI Build 36.1). (The 
repeat annotations were generated by the programs RepeatMasker [ 20 ] and Tandem Re- 
peats Finder [ 1 ].) Other sequences (Fig.|2j were downloaded from the NCBI FTP server 
( |ftp://ftp.ncbi.nlm.nih. gov/ genome s/\ . 
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